Introduction to R

Douwe Molenaar

Systems Biology Lab

2024-11-01

The course

Teaching

  • Online syllabus
    • Technical part: working with R
    • Applications: Probability theory, Statistical modeling, Causal analysis
  • Lectures: on theoretical background
  • Walk-in hours:
    • Technical help during Walk-in hours, also during assignments!

Assessment

  • Assignments
    • Expected technical knowledge, see Canvas Modules \(\rightarrow\) Assignments
    • Expected theoretical knowledge: depends on assignment (see schedule in syllabus)
  • Exam
    • Expected knowledge: theoretical background from the syllabus,
    • Linear algebra from the Math syllabus

Meeting R

Why use R

  • It is relatively easy to program. More data- and statistics- oriented than Python
  • A very well maintained Package repository: if a statistical technique exists, bet that there is an R-package
  • R has a very active community, which means quality control, efficient bug tracking and fixing, and regular updates
  • It’s relatively easy to create complex graphs
  • Literate programming: straight from data analysis to publication

In the course Biosystems Data Analysis (P3) you will use R again extensively.

Graphical display of data using R

Technical introduction

The anatomy of an R-expression

log(2.71)
[1] 0.9969486
  • log(2.71): an expression
  • log(): a function or method
  • 2.71: an argument of the function log()
  • 0.9969486: the output of the expression
  • 2.71, 0.9969486: are objects (of class numeric)

The anatomy of another R-expression

1 + 2
[1] 3
  • 1 + 2: an expression
  • '+'(): a function or method
  • 1 and 2: are arguments of the function '+'()
  • 1 and 2: are objects (of class numeric)

Alternative expression with the same result:

'+'(1,2)
[1] 3

+ is a function with two arguments. It can be written as a normal function '+'(1,2) or as an infix operator 1 + 2.

R is an object oriented language

  • All objects belong to a class
  • Classes are organized in a class hierarchy
  • Objects have associated methods

Consequences of object oriented approach

  • For the programmer: easier programming and maintenance
  • For the user: to work efficiently with R you have to know the hierarchical relations between the basic object classes

Example

sex <- factor(c("M","F","F","M","M","M"))
sex
[1] M F F M M M
Levels: F M


as.numeric(sex)
[1] 2 1 1 2 2 2

Object class hierarchy in R

Note that these are the basic classes. Packages may introduce new child classes of basic classes.

Atomic data object classes

Data objects in R

Every data object in R is derived from the vector class

  • vector is an abstract class in R, representing a tuple, a set of containers that carry an index number:

image/svg+xml 1 2 3 n index: containers:

  • You can put something in these containers: a single number 211.14, a character “a”, a string “Abc”, …
  • These containers are called elements of the vector

Atomic vectors

In an atomic vector:

  • each element holds a single atom
  • all elements contain the same type of atom

Atom types are limited to three classes:

  • numeric: numbers
  • character: character strings
  • logical: logical values (TRUE, FALSE)

numeric vector: every element contains a single number


1
[1] 1


seq(1,5)
[1] 1 2 3 4 5


1:5
[1] 1 2 3 4 5


rep(c(1,3),5)
 [1] 1 3 1 3 1 3 1 3 1 3


rnorm(5)
[1] -0.28206602 -0.10055451 -0.03093558 -1.47619504 -0.74812609

Quiz: functions and arguments

In the expressions seq(1,5) and 1:5

  • What are the functions/methods?
  • What is/are the argument(s)?
  • What is the class of the argument(s)?
  • What is the class of the output of the functions?

character vector: each element holds a single character string


"a"
[1] "a"


'Abc'
[1] "Abc"


letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"


letters is a pre-defined character vector in R. There is also the pre-defined vector LETTERS.

logical vector: each element holds a single logical value


1==1
[1] TRUE


1==2
[1] FALSE


logical vectors are usually calculated using logical functions like ==, !=, >, < etc.:


c('male','female','female','male') == 'male'
[1]  TRUE FALSE FALSE  TRUE

Creating a vector

Use the concatenation function c() to create vectors with arbitrary data

c(15.3,11.4,12.5,13.1)
[1] 15.3 11.4 12.5 13.1


c('a','b','c')
[1] "a" "b" "c"


c('a','b','sample')
[1] "a"      "b"      "sample"

Quiz: functions, input and output classes

In the expression c('a','b','sample')

  • How many arguments does the function c() take?

  • What is the class of the argument(s)?

  • What is, in general, the class of the output of the function c()?

  • How many arguments does the outer concatenation in c(c('a','b'),'sample') take? What will be the output?

Calculating with atomic vectors

Two types of base functions in R

All functions have vectors as arguments. How do they operate on vectors?

Functions that operate element-wise, with recycling of the shortest vector:

c(1,2,3) + c(5,6,7)
[1]  6  8 10
c(1,2,3,4) + c(1,2)
[1] 2 4 4 6
c(1,2,3) + 1
[1] 2 3 4
c(1,2,3) > 2
[1] FALSE FALSE  TRUE


Functions that operate on an entire vector:

mean(c(1,2,3))
[1] 2
sd(c(1,2,3))
[1] 1
sum(c(1,2,3))
[1] 6
any(c(FALSE, FALSE, FALSE, TRUE, FALSE))
[1] TRUE

Storing objects in memory

Giving an object a name with the assignment operator <- creates a semi-permanent copy in memory

mydata <- c(15.3,11.4,12.5,13.1)


mydata
[1] 15.3 11.4 12.5 13.1


mean(mydata)
[1] 13.075


sd(mydata)
[1] 1.641899

Generating and storing a vector with random numbers

sample <- runif(100)


sample
  [1] 0.321652743 0.255329920 0.813072467 0.844754566 0.741460668 0.452447160
  [7] 0.660586335 0.884833407 0.376626137 0.396120389 0.203076562 0.613253285
 [13] 0.755529126 0.082694459 0.340706577 0.261525724 0.256461629 0.236181747
 [19] 0.477760768 0.450746441 0.475976889 0.247493265 0.983040848 0.634353817
 [25] 0.516292815 0.764290652 0.625141596 0.249894288 0.612004598 0.850249073
 [31] 0.996369231 0.276770240 0.553375643 0.425778985 0.279011318 0.575475608
 [37] 0.547314094 0.776472423 0.786281534 0.458755063 0.392587437 0.984401637
 [43] 0.952639495 0.609412803 0.075744548 0.792524322 0.748155368 0.870700320
 [49] 0.610821505 0.919262931 0.037285194 0.157941364 0.565128270 0.973846421
 [55] 0.337083640 0.220714435 0.775216294 0.264685093 0.834163542 0.709327108
 [61] 0.073919139 0.942226410 0.581857634 0.656319579 0.583717612 0.630195420
 [67] 0.287243839 0.524404053 0.693821546 0.727689154 0.184470702 0.592739058
 [73] 0.511415237 0.135698926 0.551563792 0.117924483 0.231563725 0.218273374
 [79] 0.393647147 0.320536121 0.083605645 0.269399869 0.679423194 0.937780983
 [85] 0.871313597 0.787220405 0.879004461 0.470023791 0.702892255 0.409319881
 [91] 0.632400250 0.077210962 0.356680507 0.142912854 0.976355729 0.033196826
 [97] 0.001373154 0.676875147 0.427970793 0.778332553

Real life: missing data

Elements of any atomic type of vector can have the value NA: not available

  • NA is not the same as 0 or ""

Operations on objects containing NA may or may not yield NA:

mydata <- c(15.3,11.4,NA,13.1)
mydata + 1
[1] 16.3 12.4   NA 14.1


mean(mydata)
[1] NA


mean(mydata, na.rm=TRUE)
[1] 13.26667

Real life: avoiding program exit upon numerical errors

  • Inf, -Inf: resulting from division by 0 or from calculations beyond the numeric capacity of the computer
mydata <- c(15.3,11.4,0,13.1)


1/mydata
[1] 0.06535948 0.08771930        Inf 0.07633588


  • NaN: resulting from numerical operations leading to an undefined number
mydata <- c(15.3,11.4,0,13.1)


0/mydata
[1]   0   0 NaN   0

Quiz: simulate 100 rolls with a fair die

  • create one or two R-statements
  • required output: a vector of 100 random numbers from the set 1..6
  • all numbers must have equal probability to be drawn
  • use, for example
    • sample() or sample.int()
    • help() for these functions

Composite object classes

factor: each element holds a single discrete value

factor: a vector with a limited vocabulary

gender <- factor(c("Male", "Female", "Female", "Male"))


gender
[1] Male   Female Female Male  
Levels: Female Male


class(gender)
[1] "factor"


The factor class is derived from the integer class.


as.numeric(gender)
[1] 2 1 1 2

image/svg+xml vector list array matrix data.frame character logical numeric integer   factor ordered

array: a vector with an attribute

array: consists of an atomic vector. Special case: matrix, a 2-dimensional array

v <- 1:4


class(v)
[1] "integer"


attributes(v)
NULL


mymat <- matrix(v, ncol=2)


mymat
     [,1] [,2]
[1,]    1    3
[2,]    2    4


attributes(mymat)
$dim
[1] 2 2

data.frame: the equivalent of a table

  • Elements in a data.frame correspond to columns
  • The elements contain vectors
  • Vectors inside the elements of a data.frame must all have equal length
d <- data.frame(
  gender = factor(c("M","F","F", "M")),
  weight = c(69.1,56.4,61.0,81.5)
)


d
  gender weight
1      M   69.1
2      F   56.4
3      F   61.0
4      M   81.5


d$weight
[1] 69.1 56.4 61.0 81.5


mean(d$weight)
[1] 67

Methods: statistics made easier

plot(d, col='steelblue')

t.test(weight~gender, data=d)

    Welch Two Sample t-test

data:  weight by gender
t = -3.4717, df = 15.944, p-value = 0.00316
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -17.014143  -4.110979
sample estimates:
mean in group Female   mean in group Male 
            67.44078             78.00334 

Two equivalent ways to take an object out of an element

  1. The $ operator works with named indexes:
d$weight
 [1] 72.93805 91.13514 73.61649 64.29421 84.09792 71.75040 81.99970 73.98459
 [9] 82.68189 83.53499 68.73414 73.21086 66.80506 69.45581 66.43022 72.51171
[17] 60.32600 73.96749 65.76018 57.20630
  1. The [[() function works with numeric and named indexes:
d[['weight']]
 [1] 72.93805 91.13514 73.61649 64.29421 84.09792 71.75040 81.99970 73.98459
 [9] 82.68189 83.53499 68.73414 73.21086 66.80506 69.45581 66.43022 72.51171
[17] 60.32600 73.96749 65.76018 57.20630


d[[2]]
 [1] 72.93805 91.13514 73.61649 64.29421 84.09792 71.75040 81.99970 73.98459
 [9] 82.68189 83.53499 68.73414 73.21086 66.80506 69.45581 66.43022 72.51171
[17] 60.32600 73.96749 65.76018 57.20630

R-Packages

R packages

  • A package is a coherent set of
    • class definitions
    • functions operating on these classes
    • examples of the use of these classes
  • These are packaged in a single file that, after installation of the package, results in a library folder with several files
    • Look into the library subdirectory of the R installation directory (but don’t mess with it)

R packages

  • A number of packages is loaded at start-up of R
    • They contain definitions of commonly used data structures and functions
  • Other packages are loaded on-demand by the user with functions
    • library(packagename)
    • require(packagename)

Examples of R-packages

  • Loaded at start-up
    • base: all basic classes and functions used in R, including basic statistical functions
    • stats: classes and functions for more advanced statistical operations
  • Loaded on demand
    • edgeR: classes and functions for RNAseq analysis
    • ggplot2: classes and functions for advanced graphics

Old and new approaches

Graphical systems

The old way (plot)

show code
plot(x=iris$Petal.Length, y=iris$Sepal.Length, 
     asp=1,
     xlab="Petal length (cm)", ylab="Sepal length (cm)", 
     col=c('maroon','plum3','navyblue')[iris$Species],
     pch = 19
     )
lines(x=c(1,7),y=c(4.31+0.41*1, 4.31+0.41*7), lwd=2, col='red')
legend(1.2, 8,
       pch=19,
       col=c('maroon','plum3','navyblue'), 
       legend=levels(iris$Species))

Good enough for “quick plots”

The tidyverse way

show code
ggplot(data=iris, mapping=aes(x=Petal.Length, y=Sepal.Length, colour=Species)) +
  labs(x='Petal length (cm)', y='Sepal length (cm)') +
  stat_smooth(method=lm, se=FALSE, colour="red") +
  geom_point(size=3) + 
  theme(legend.position=c(0.05,0.95), legend.justification=c(0,1)) +
  scale_color_manual(values=c('maroon','plum3','navyblue'))

Publication quality graphs

Data processing

Example: Sepal width and length means and standard deviations per species

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.8         2.7          4.1           1 versicolor
2            6.4         2.8          5.6         2.1  virginica
3            4.4         3.2          1.3         0.2     setosa
..            ..          ..           ..          ..         ..
150          6.1         2.8            4         1.3 versicolor


The old way

show code
means <- do.call(rbind, lapply(iris[,c("Sepal.Length", "Sepal.Width")], function(x) {tapply(x, iris$Species, mean)}))
sds <- do.call(rbind, lapply(iris[,c("Sepal.Length", "Sepal.Width")], function(x) {tapply(x, iris$Species, sd)}))
rownames(means) <- paste0(rownames(means), "_avg")
rownames(sds) <- paste0(rownames(sds), "_sd")
df <- as.data.frame(rbind(means,sds))
df
                    setosa versicolor virginica
Sepal.Length_avg 5.0060000  5.9360000 6.5880000
Sepal.Width_avg  3.4280000  2.7700000 2.9740000
Sepal.Length_sd  0.3524897  0.5161711 0.6358796
Sepal.Width_sd   0.3790644  0.3137983 0.3224966

The tidyverse way

show code
iris |>
  group_by(Species) |>
  select(starts_with("Sepal")) |>
  summarise(across(where(is.numeric), list(avg=mean, sd=sd))) |>
  flextable()

Species

Sepal.Length_avg

Sepal.Length_sd

Sepal.Width_avg

Sepal.Width_sd

setosa

5.006

0.3524897

3.428

0.3790644

versicolor

5.936

0.5161711

2.770

0.3137983

virginica

6.588

0.6358796

2.974

0.3224966

Integrating text and code

Document writing is integrated with code writing

Several flavors and purposes

  • Literate programming: focus on programming logic
  • Documentation generation: focus on program documentation
  • RMarkdown, Quarto: focus on scientific and technical document generation

Quarto documents


The weaving process

image/svg+xml pandoc knitr quarto Demoproject.qmd Markdown Demoproject.md HTML, PDF, ... Demoproject.html

Programming

Functions

Functions do things with the input

  • Input: function arguments (can be one or several objects)
  • Output: result of a function (exactly one object)

R contains many pre-defined functions, you saw mean, sd, sum, t.test

You can define your own functions:

add <- function(a,b) {
  result = a+b
  return(result)
}


add(4,3)
[1] 7

Programming

Branching

if (a > b) {
  print(a - b)
} else {
  print(b - a)
}


Looping

for (i in 1:10) {
   print(i*i)
}
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36
[1] 49
[1] 64
[1] 81
[1] 100